This barchart represents the number of taxi trips made in Chicago per year. We see that the in 2014 the frequency of taxis was at its peak and is decreasing since.

Hint of seasonality

Through the plot below we can study the monthly behaviour of taxi frequencies for each year. We notice that between 2013 and 2019 the taxi trips frequency follow a similar behaviour, this may suggest that seasonality may have an influence on the taxi trips frequency. For example, every March, May and October the taxi frequency increased and the fequency seems to decrease for the months of June, July and November. However, We notice that years 2020 and 2021 seem to have a different pattern than the rest of the months.

In order to see the monthly effect on the taxi trips frequency we can plot the frequency through a heatmap and see its changes over the years and months.

In this plot, the high frequency is visualized by a a dark colour and the low frequency by a bright colour. In this plot, we can compare the frequency by year and by month. We notice that there may be a monthly (seasonal) effect on the taxi frequency, for example we see that in March and October the pattern seems to be darker than for the rest of the months.

Time series analysis

## [1] 106   2

To explore in more details the relationship between time and taxi frequency we decide to analyze the taxi trips frequency as time series. Without any estimation we notice that there seems to be a decreasing trend after 2014.

p <- ggplot(by_y_m, aes(x = ym, y = count)) +
  geom_line()  +  geom_smooth()+ theme_bw() + theme(axis.text.x = element_text(angle = 45, vjust = 0.1, hjust=0.5)) +  scale_y_continuous(labels =comma) + xlab("Date") + ylab("Frequency") + labs(title = "Frequency of taxis between 2013 and 2021")
  

ggplotly(p)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Analysis by district

To analyze further the change in behaviour of the taxi trips frequency in Chicago we may group our data by districts. In the original dataset we have the information about the community areas, but since the number of areas is very high (77 areas) and it is difficult to visualize frequencies of 77 categories in one plot, we suggest to group our data using another geographical feauture: districts. According to Chicago 77 (http://www.thechicago77.com/chicago-neighborhoods/), we can regroup the areas into 9 districts. We now get a new variable: district which is the district in which the taxi made a pick up.

By plotting the frequencies of taxi trips through time for each district in one plot, we can assess the differences in the decrease of the frequency of taxi trips in each district and compare them. We notice that the districts which originally had a very small number of taxi trips (for example, Central, Near North, and Near South Side) didn’t had such a strong decrease of frequency as the districts that had higher values between 2013 and 2015 (for example, Far North Side).

In this part we suggest to consider that there may be a relation for the taxi frequency through time and propose to perform a time series analysis to estimate the trend, the seasonal effects and see whether the observations are correlated through time and can be represented as a particular time series model.

We first estimate a linear model with a liner trend and dummy variables for months (where we omit the month of January) as covariates. The output is the taxi trips frequency.

## MODEL INFO:
## Observations: 106
## Dependent Variable: number_of_taxis
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(12,93) = 18.7396, p = 0.0000
## R² = 0.7074
## Adj. R² = 0.6697 
## 
## Standard errors: OLS
## ------------------------------------------------------------------
##                             Est.          S.E.     t val.        p
## ----------------- -------------- ------------- ---------- --------
## (Intercept)         2622312.0927   178247.0026    14.7117   0.0000
## time                 -23006.6277     1542.9894   -14.9104   0.0000
## monthFeb              55434.5166   228284.2278     0.2428   0.8087
## monthMar             274823.8110   228299.8710     1.2038   0.2317
## monthApr             179397.8832   228325.9407     0.7857   0.4340
## monthMay             326897.9554   228362.4332     1.4315   0.1556
## monthJun             362441.8054   228409.3436     1.5868   0.1160
## monthJul             266674.3220   228466.6654     1.1672   0.2461
## monthAug             319552.8386   228534.3909     1.3983   0.1654
## monthSep             286716.3553   228612.5108     1.2542   0.2129
## monthOct             401104.8719   228701.0143     1.7538   0.0828
## monthNov             277457.6776   235385.5510     1.1787   0.2415
## monthDec             266354.1804   235431.0621     1.1313   0.2608
## ------------------------------------------------------------------

The variable time (which represents the trend) is negative and significant at 1% level, meaning that overall, we observe a decreasing trend through time. During the sample period, the number of taxi trips in Chicago decreased by an average of 23006 trips per month. The dummy variables representing the months (the seasonal components) are not significant and therefore do not suggest a difference in frequency of taxi trips depending on month in this model.

We consider to fit a second model as it may be more appropriate to consider the trend as quadratic. From the previous graphics, we notice that before 2014 the number of taxi trips seemed to increase and after 2014 the number of taxi trips decreased.

## MODEL INFO:
## Observations: 106
## Dependent Variable: number_of_taxis
## Type: OLS linear regression 
## 
## MODEL FIT:
## F(13,92) = 55.4318, p = 0.0000
## R² = 0.8868
## Adj. R² = 0.8708 
## 
## Standard errors: OLS
## ------------------------------------------------------------------
##                             Est.          S.E.     t val.        p
## ----------------- -------------- ------------- ---------- --------
## (Intercept)         1820730.0989   129757.7391    14.0318   0.0000
## time                  22566.8437     3896.3921     5.7917   0.0000
## I(time^2)              -425.9203       35.2803   -12.0725   0.0000
## monthFeb              52027.1543   142778.5245     0.3644   0.7164
## monthMar             268860.9269   142788.8837     1.8829   0.0629
## monthApr             171731.3179   142805.7465     1.2026   0.2322
## monthMay             318379.5495   142828.9013     2.2291   0.0282
## monthJun             353923.3995   142858.2406     2.4774   0.0151
## monthJul             259007.7567   142893.7606     1.8126   0.0732
## monthAug             313589.9545   142935.5611     2.1939   0.0308
## monthSep             283308.9929   142983.8456     1.9814   0.0505
## monthOct             401104.8719   143038.9208     2.8042   0.0062
## monthNov             182051.5317   147431.6610     1.2348   0.2200
## monthDec             170948.0345   147460.0845     1.1593   0.2493
## ------------------------------------------------------------------

In this second model, the trend and the trend squared are both significant at 1% level, as well as some of the dummy variables representing months: May, June, Augusr, September and October, meaning that there is a significant difference in taxi trips frequency in these month compared to the month of January. The R squared in the second model is higher than in the first model, meaning that the the model with the squared trend fits better the observed data.

Residual analysis

To verify the assumptions of the linear model (the residuals are normally distribued with a mean equal to 0 and a constant variance), we perform a residual analysis. In the first plot representing the residuals versus fitted values, we notice that the average mean of the residuals is close to 0. When the fitted values are small and high, the residuals are spread, in the middle of the plot the residuals seem to be dense. The spread is not constant, and we may suspect a heteroskedasticity issue in our model. Now looking at the qq plot we see that the points fall alonf the line in the middle of the graph, but not at the extremeties. This suggests that we may have too many extreme values.

In this part we represent the autocorrelation function and the partial autocorrelation functions which provide information about the dependence structure the time series may have. From the plots we may have a hint that we are in presence of a autoregressive model of order 1.

## 
## Call:
## arima(x = residuals, order = c(2, 0, 0), n.cond = 3)
## 
## Coefficients:
##          ar1      ar2  intercept
##       1.0770  -0.2058  -21593.73
## s.e.  0.0991   0.1026   90775.23
## 
## sigma^2 estimated as 1.577e+10:  log likelihood = -1395.78,  aic = 2799.55